AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
The data contains characteristics of the people
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
# Libraries to build decision tree classifier
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
data = pd.read_csv("Loan_Modelling.csv")
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
np.random.seed(1) # to get the same random results every time
data.sample(n=10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2764 | 2765 | 31 | 5 | 84 | 91320 | 1 | 2.9 | 3 | 105 | 0 | 0 | 0 | 0 | 1 |
| 4767 | 4768 | 35 | 9 | 45 | 90639 | 3 | 0.9 | 1 | 101 | 0 | 1 | 0 | 0 | 0 |
| 3814 | 3815 | 34 | 9 | 35 | 94304 | 3 | 1.3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3499 | 3500 | 49 | 23 | 114 | 94550 | 1 | 0.3 | 1 | 286 | 0 | 0 | 0 | 1 | 0 |
| 2735 | 2736 | 36 | 12 | 70 | 92131 | 3 | 2.6 | 2 | 165 | 0 | 0 | 0 | 1 | 0 |
| 3922 | 3923 | 31 | 4 | 20 | 95616 | 4 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2701 | 2702 | 50 | 26 | 55 | 94305 | 1 | 1.6 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1179 | 1180 | 36 | 11 | 98 | 90291 | 3 | 1.2 | 3 | 0 | 0 | 1 | 0 | 0 | 1 |
| 932 | 933 | 51 | 27 | 112 | 94720 | 3 | 1.8 | 2 | 0 | 0 | 1 | 1 | 1 | 1 |
| 792 | 793 | 41 | 16 | 98 | 93117 | 1 | 4.0 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
data.shape
(5000, 14)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
data.duplicated().sum()
0
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
Age: Average age of people in the dataset is 45 years, age ranges from 23 to 67 years.Income: The average income in dollars is 73000. There's a large difference between the maximum value and 75th percentile which indicates that there might be outliers present in this variable.CC_Avg: There's a huge difference in the 75th percentile and maximum value of CC_Avg indicating the presence of outliers. Also, the minimum observations are 0.Experience: The minimum value is -3 years. Years of experience should not be in negative values.Explore Experience column further
data.sort_values(by=["Experience"], ascending=True).head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4514 | 4515 | 24 | -3 | 41 | 91768 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2618 | 2619 | 23 | -3 | 55 | 92704 | 3 | 2.4 | 2 | 145 | 0 | 0 | 0 | 1 | 0 |
| 4285 | 4286 | 23 | -3 | 149 | 93555 | 2 | 7.2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3626 | 3627 | 24 | -3 | 28 | 90089 | 4 | 1.0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3796 | 3797 | 24 | -2 | 50 | 94920 | 3 | 2.4 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
We will attempt to fix the negative values in the Experience column
# function to return the absolute values for all values in *Experience* column
data["Experience"] = data["Experience"].apply(lambda x: abs(x))
# import python package to map zipcodes to different locations
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=True)
# get zipcode county locations
zip_list = data.ZIPCode.tolist()
zip_data = dict()
for z in zip_list:
zipcode = search.by_zipcode(z)
zip_dic = zipcode.to_dict()
zip_city = zip_dic["county"]
zip_data[z] = zip_city
# Map to Zipcode column
data["County"] = data["ZIPCode"].map(zip_data)
data["County"].value_counts(dropna=False)
Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Monterey County 128 Ventura County 114 San Bernardino County 101 Contra Costa County 85 Santa Cruz County 68 Riverside County 56 Kern County 54 Marin County 54 NaN 34 Solano County 33 San Luis Obispo County 33 Humboldt County 32 Sonoma County 28 Fresno County 26 Placer County 24 Butte County 19 Shasta County 18 El Dorado County 17 Stanislaus County 15 San Benito County 14 San Joaquin County 13 Mendocino County 8 Tuolumne County 7 Siskiyou County 7 Merced County 4 Trinity County 4 Lake County 4 Napa County 3 Imperial County 3 Name: County, dtype: int64
There are still some values that do not belong to a California county. Let's replace nan values with the string unknown
data["County"].fillna("Unknown", inplace=True)
cat_group = [
"County",
"Family",
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
]
# convert to category
for cat in cat_group:
data[cat] = data[cat].astype("category")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null category 6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null category 10 Securities_Account 5000 non-null category 11 CD_Account 5000 non-null category 12 Online 5000 non-null category 13 CreditCard 5000 non-null category 14 County 5000 non-null category dtypes: category(8), float64(1), int64(6) memory usage: 314.9 KB
# copying data to another variable to avoid any changes to original data
df = data.copy()
Lets us look at different levels in categorical variables
# filtering object type columns
cat_col = data.describe(include=["category"]).columns
cat_col
Index(['Family', 'Education', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard', 'County'],
dtype='object')
for i in cat_col:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
Unique values in Family are : 1 1472 2 1296 4 1222 3 1010 Name: Family, dtype: int64 ************************************************** Unique values in Education are : 1 2096 3 1501 2 1403 Name: Education, dtype: int64 ************************************************** Unique values in Personal_Loan are : 0 4520 1 480 Name: Personal_Loan, dtype: int64 ************************************************** Unique values in Securities_Account are : 0 4478 1 522 Name: Securities_Account, dtype: int64 ************************************************** Unique values in CD_Account are : 0 4698 1 302 Name: CD_Account, dtype: int64 ************************************************** Unique values in Online are : 1 2984 0 2016 Name: Online, dtype: int64 ************************************************** Unique values in CreditCard are : 0 3530 1 1470 Name: CreditCard, dtype: int64 ************************************************** Unique values in County are : Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 San Francisco County 257 San Mateo County 204 Sacramento County 184 Santa Barbara County 154 Yolo County 130 Monterey County 128 Ventura County 114 San Bernardino County 101 Contra Costa County 85 Santa Cruz County 68 Riverside County 56 Kern County 54 Marin County 54 Unknown 34 San Luis Obispo County 33 Solano County 33 Humboldt County 32 Sonoma County 28 Fresno County 26 Placer County 24 Butte County 19 Shasta County 18 El Dorado County 17 Stanislaus County 15 San Benito County 14 San Joaquin County 13 Mendocino County 8 Tuolumne County 7 Siskiyou County 7 Lake County 4 Merced County 4 Trinity County 4 Imperial County 3 Napa County 3 Name: County, dtype: int64 **************************************************
def histogram_boxplot(data, feature, figsize=(12, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "Age")
histogram_boxplot(data, "Experience")
histogram_boxplot(data, "Income")
histogram_boxplot(data, "CCAvg")
histogram_boxplot(data, "Mortgage")
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "Family", perc=True)
labeled_barplot(data, "Education", perc=True)
labeled_barplot(data, "Securities_Account", perc=True)
labeled_barplot(data, "CD_Account", perc=True)
labeled_barplot(data, "Online", perc=True)
labeled_barplot(data, "CreditCard", perc=True)
plt.figure(figsize=(12, 18))
sns.countplot(y="County", data=data)
<AxesSubplot:xlabel='count', ylabel='County'>
labeled_barplot(data, "Personal_Loan", perc=True)
sns.pairplot(data=data.select_dtypes(include=np.number))
<seaborn.axisgrid.PairGrid at 0x350da0e3a0>
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set()
## crosstab
tab1 = pd.crosstab(x, data["Personal_Loan"], margins=True).sort_values(
by=1, ascending=False
)
print(tab1)
print("-" * 120)
## visualising the cross tab
tab = pd.crosstab(x, data["Personal_Loan"], normalize="index").sort_values(
by=1, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(17, 7))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 0],
palette="gist_rainbow",
showmeans=True,
)
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
showmeans=True,
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
distribution_plot_wrt_target(data, "Experience", "Personal_Loan")
distribution_plot_wrt_target(data, "Income", "Personal_Loan")
distribution_plot_wrt_target(data, "CCAvg", "Personal_Loan")
plt.figure(figsize=(15, 12))
sns.boxplot(
y="Education",
x="Income",
data=data,
hue="Personal_Loan",
showfliers=False,
showmeans=True,
)
plt.show()
Creating training and test sets.
# drop ID and ZIPCode column from model
X = data.drop(["Personal_Loan", "ID", "ZIPCode"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 52) Shape of test set : (1500, 52) Percentage of classes in training set: 0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.900667 1 0.099333 Name: Personal_Loan, dtype: float64
Both the cases are important as:
If we predict a customer is likely to purchase a loan but in reality the customer did not, marketing campaigns and resources will be wasted.
If we predict a person isn't likely to purchase a loan but in reality the customer did, we would lose a potential customer and this would negatively affect our success ratio.
Recall should be maximized, the greater the recall the higher the chances of increasing the success ratio.# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(X_train, y_train)
# predicting on training set
y_pred_train = lg.predict(X_train)
print("Training set performance:")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision:", precision_score(y_train, y_pred_train))
print("Recall:", recall_score(y_train, y_pred_train))
print("F1:", f1_score(y_train, y_pred_train))
Training set performance: Accuracy: 0.9625714285714285 Precision: 0.890625 Recall: 0.6888217522658611 F1: 0.776831345826235
# predicting on the test set
y_pred_test = lg.predict(X_test)
print("Test set performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision:", precision_score(y_test, y_pred_test))
print("Recall:", recall_score(y_test, y_pred_test))
print("F1:", f1_score(y_test, y_pred_test))
Test set performance: Accuracy: 0.954 Precision: 0.9081632653061225 Recall: 0.5973154362416108 F1: 0.7206477732793521
Observations
The training and testing recall scores are 0.69 and 0.60 respectively.
We will look to improve our model and optimize for recall.
We will now perform logistic regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.
Using statsmodels, we will be able to check the statistical validity of our model - identify the significant predictors from p-values that we get for each predictor variable.
X = data.drop(["Personal_Loan", "ID", "ZIPCode"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, drop_first=True)
# adding constant
X = sm.add_constant(X)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
# fitting logistic regression model
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3447
Method: MLE Df Model: 52
Date: Fri, 30 Jul 2021 Pseudo R-squ.: 0.6694
Time: 23:14:30 Log-Likelihood: -362.13
converged: False LL-Null: -1095.5
Covariance Type: nonrobust LLR p-value: 9.412e-273
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
const -13.8482 2.302 -6.015 0.000 -18.361 -9.336
Age 0.0009 0.084 0.010 0.992 -0.163 0.165
Experience 0.0039 0.084 0.046 0.963 -0.160 0.168
Income 0.0647 0.004 16.051 0.000 0.057 0.073
CCAvg 0.2475 0.061 4.025 0.000 0.127 0.368
Mortgage 0.0010 0.001 1.261 0.207 -0.001 0.003
Family_2 0.0891 0.304 0.293 0.770 -0.507 0.685
Family_3 2.6546 0.332 8.007 0.000 2.005 3.304
Family_4 1.7441 0.322 5.411 0.000 1.112 2.376
Education_2 4.0518 0.357 11.348 0.000 3.352 4.752
Education_3 4.3296 0.357 12.116 0.000 3.629 5.030
Securities_Account_1 -1.0757 0.420 -2.562 0.010 -1.899 -0.253
CD_Account_1 3.8955 0.459 8.494 0.000 2.997 4.794
Online_1 -0.6562 0.214 -3.068 0.002 -1.075 -0.237
CreditCard_1 -1.1389 0.285 -3.990 0.000 -1.698 -0.579
County_Butte County -21.6373 1.54e+05 -0.000 1.000 -3.02e+05 3.02e+05
County_Contra Costa County 0.3412 0.848 0.403 0.687 -1.320 2.002
County_El Dorado County -0.7120 1.438 -0.495 0.621 -3.531 2.107
County_Fresno County -1.0425 2.297 -0.454 0.650 -5.544 3.459
County_Humboldt County -1.1850 1.789 -0.662 0.508 -4.691 2.321
County_Imperial County -14.9153 2.75e+04 -0.001 1.000 -5.4e+04 5.4e+04
County_Kern County 1.5868 0.790 2.008 0.045 0.038 3.136
County_Lake County -15.5570 2.3e+04 -0.001 0.999 -4.51e+04 4.51e+04
County_Los Angeles County 0.0587 0.384 0.153 0.878 -0.693 0.810
County_Marin County 0.4467 0.918 0.487 0.627 -1.353 2.246
County_Mendocino County -2.2792 4.574 -0.498 0.618 -11.245 6.686
County_Merced County -12.3470 1174.727 -0.011 0.992 -2314.769 2290.075
County_Monterey County -0.0422 0.705 -0.060 0.952 -1.424 1.340
County_Napa County -9.0100 1700.622 -0.005 0.996 -3342.168 3324.148
County_Orange County 0.2166 0.500 0.434 0.665 -0.763 1.196
County_Placer County 1.1573 1.023 1.131 0.258 -0.848 3.163
County_Riverside County 2.2206 0.830 2.674 0.007 0.593 3.848
County_Sacramento County 0.1477 0.604 0.245 0.807 -1.036 1.332
County_San Benito County -15.3850 7086.848 -0.002 0.998 -1.39e+04 1.39e+04
County_San Bernardino County -0.9709 1.121 -0.866 0.386 -3.167 1.225
County_San Diego County 0.1617 0.437 0.370 0.712 -0.696 1.019
County_San Francisco County 0.2838 0.540 0.525 0.599 -0.775 1.343
County_San Joaquin County -0.1762 7.310 -0.024 0.981 -14.503 14.150
County_San Luis Obispo County -1.5316 2.219 -0.690 0.490 -5.882 2.818
County_San Mateo County -1.2440 0.669 -1.859 0.063 -2.556 0.068
County_Santa Barbara County 0.5210 0.642 0.812 0.417 -0.736 1.779
County_Santa Clara County 0.2868 0.433 0.663 0.507 -0.561 1.135
County_Santa Cruz County 0.0526 0.859 0.061 0.951 -1.631 1.736
County_Shasta County -4.4260 10.008 -0.442 0.658 -24.041 15.189
County_Siskiyou County -41.7980 5.39e+09 -7.75e-09 1.000 -1.06e+10 1.06e+10
County_Solano County 1.0975 1.046 1.050 0.294 -0.952 3.147
County_Sonoma County 1.3798 1.159 1.191 0.234 -0.891 3.651
County_Stanislaus County -14.9944 1935.987 -0.008 0.994 -3809.459 3779.471
County_Trinity County -13.5624 2566.088 -0.005 0.996 -5043.003 5015.878
County_Tuolumne County -21.4353 1.58e+05 -0.000 1.000 -3.1e+05 3.1e+05
County_Unknown 0.7703 1.076 0.716 0.474 -1.340 2.880
County_Ventura County 0.1638 0.653 0.251 0.802 -1.116 1.444
County_Yolo County -0.4235 0.772 -0.548 0.583 -1.937 1.090
=================================================================================================
Possibly complete quasi-separation: A fraction 0.17 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.964571 | 0.712991 | 0.890566 | 0.791946 |
Observations
Negative values of the coefficient shows that probability of customer purchasing a loan decreases with the increase of corresponding attribute value.
Positive values of the coefficient show that that probability of customer purchasing a loan increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
But these variables might contain multicollinearity, which will affect the p-values.
We will have to remove multicollinearity from the data to get reliable coefficients and p-values.
There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.
Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is “inflated”by the existence of correlation among the predictor variables in the model.
General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: const 461.006340 Age 92.482526 Experience 92.377065 Income 1.917875 CCAvg 1.760991 Mortgage 1.061009 Family_2 1.420952 Family_3 1.399521 Family_4 1.442466 Education_2 1.316474 Education_3 1.341745 Securities_Account_1 1.160051 CD_Account_1 1.376503 Online_1 1.056084 CreditCard_1 1.126089 County_Butte County 1.036412 County_Contra Costa County 1.142204 County_El Dorado County 1.030292 County_Fresno County 1.034880 County_Humboldt County 1.064271 County_Imperial County 1.007742 County_Kern County 1.094882 County_Lake County 1.014048 County_Los Angeles County 2.430983 County_Marin County 1.093672 County_Mendocino County 1.025040 County_Merced County 1.010499 County_Monterey County 1.218541 County_Napa County 1.010684 County_Orange County 1.562459 County_Placer County 1.046803 County_Riverside County 1.089916 County_Sacramento County 1.327556 County_San Benito County 1.030201 County_San Bernardino County 1.182095 County_San Diego County 1.821752 County_San Francisco County 1.439310 County_San Joaquin County 1.016484 County_San Luis Obispo County 1.053886 County_San Mateo County 1.342339 County_Santa Barbara County 1.246687 County_Santa Clara County 1.840770 County_Santa Cruz County 1.127757 County_Shasta County 1.021879 County_Siskiyou County 1.015372 County_Solano County 1.068641 County_Sonoma County 1.062670 County_Stanislaus County 1.027832 County_Trinity County 1.010589 County_Tuolumne County 1.012740 County_Unknown 1.068246 County_Ventura County 1.194778 County_Yolo County 1.210197 dtype: float64
X_train1 = X_train.drop(["Age"], axis=1)
vif_series2 = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series2))
Series before feature selection: const 23.652949 Experience 1.025191 Income 1.913432 CCAvg 1.756393 Mortgage 1.060982 Family_2 1.420399 Family_3 1.396525 Family_4 1.442466 Education_2 1.304204 Education_3 1.264376 Securities_Account_1 1.159693 CD_Account_1 1.375711 Online_1 1.056000 CreditCard_1 1.126081 County_Butte County 1.036132 County_Contra Costa County 1.142111 County_El Dorado County 1.030186 County_Fresno County 1.034880 County_Humboldt County 1.063616 County_Imperial County 1.007093 County_Kern County 1.094804 County_Lake County 1.013812 County_Los Angeles County 2.430427 County_Marin County 1.092765 County_Mendocino County 1.025029 County_Merced County 1.010210 County_Monterey County 1.218539 County_Napa County 1.010605 County_Orange County 1.559542 County_Placer County 1.046714 County_Riverside County 1.089854 County_Sacramento County 1.327411 County_San Benito County 1.030144 County_San Bernardino County 1.181982 County_San Diego County 1.820576 County_San Francisco County 1.439278 County_San Joaquin County 1.016454 County_San Luis Obispo County 1.053022 County_San Mateo County 1.342264 County_Santa Barbara County 1.246432 County_Santa Clara County 1.840647 County_Santa Cruz County 1.126966 County_Shasta County 1.021828 County_Siskiyou County 1.014885 County_Solano County 1.068631 County_Sonoma County 1.062669 County_Stanislaus County 1.027830 County_Trinity County 1.010417 County_Tuolumne County 1.012589 County_Unknown 1.067740 County_Ventura County 1.194689 County_Yolo County 1.210151 dtype: float64
logit2 = sm.Logit(y_train, X_train1.astype(float))
lg2 = logit2.fit()
print("Training performance:")
model_performance_classification_statsmodels(lg2, X_train1, y_train)
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.103465
Iterations: 35
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.964571 | 0.712991 | 0.890566 | 0.791946 |
X_train2 = X_train.drop(["Experience", "Age"], axis=1)
vif_series3 = pd.Series(
[variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
index=X_train2.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series3))
Series before feature selection: const 19.950403 Income 1.911379 CCAvg 1.753764 Mortgage 1.060903 Family_2 1.420357 Family_3 1.396496 Family_4 1.436959 Education_2 1.302822 Education_3 1.264357 Securities_Account_1 1.159522 CD_Account_1 1.374398 Online_1 1.055975 CreditCard_1 1.125960 County_Butte County 1.035459 County_Contra Costa County 1.141893 County_El Dorado County 1.030177 County_Fresno County 1.034869 County_Humboldt County 1.063608 County_Imperial County 1.006988 County_Kern County 1.094804 County_Lake County 1.013733 County_Los Angeles County 2.428366 County_Marin County 1.092178 County_Mendocino County 1.024962 County_Merced County 1.009914 County_Monterey County 1.218245 County_Napa County 1.010364 County_Orange County 1.559488 County_Placer County 1.046713 County_Riverside County 1.089823 County_Sacramento County 1.327384 County_San Benito County 1.027997 County_San Bernardino County 1.181490 County_San Diego County 1.820564 County_San Francisco County 1.438490 County_San Joaquin County 1.016415 County_San Luis Obispo County 1.052886 County_San Mateo County 1.342009 County_Santa Barbara County 1.246292 County_Santa Clara County 1.840406 County_Santa Cruz County 1.126836 County_Shasta County 1.021739 County_Siskiyou County 1.014885 County_Solano County 1.068532 County_Sonoma County 1.060918 County_Stanislaus County 1.027353 County_Trinity County 1.010400 County_Tuolumne County 1.012589 County_Unknown 1.067630 County_Ventura County 1.193087 County_Yolo County 1.210149 dtype: float64
logit3 = sm.Logit(y_train, X_train2.astype(float))
lg3 = logit3.fit()
print("Training performance:")
model_performance_classification_statsmodels(lg3, X_train2, y_train)
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.103505
Iterations: 35
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.964571 | 0.712991 | 0.890566 | 0.791946 |
print(lg3.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3449
Method: MLE Df Model: 50
Date: Fri, 30 Jul 2021 Pseudo R-squ.: 0.6693
Time: 23:15:01 Log-Likelihood: -362.27
converged: False LL-Null: -1095.5
Covariance Type: nonrobust LLR p-value: 3.673e-274
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
const -13.7214 0.824 -16.655 0.000 -15.336 -12.107
Income 0.0647 0.004 16.096 0.000 0.057 0.073
CCAvg 0.2441 0.061 3.989 0.000 0.124 0.364
Mortgage 0.0010 0.001 1.234 0.217 -0.001 0.002
Family_2 0.0884 0.304 0.291 0.771 -0.507 0.684
Family_3 2.6536 0.332 7.999 0.000 2.003 3.304
Family_4 1.7411 0.322 5.401 0.000 1.109 2.373
Education_2 4.0480 0.356 11.356 0.000 3.349 4.747
Education_3 4.3260 0.355 12.175 0.000 3.630 5.022
Securities_Account_1 -1.0805 0.421 -2.569 0.010 -1.905 -0.256
CD_Account_1 3.9083 0.459 8.524 0.000 3.010 4.807
Online_1 -0.6557 0.214 -3.067 0.002 -1.075 -0.237
CreditCard_1 -1.1416 0.285 -4.000 0.000 -1.701 -0.582
County_Butte County -22.3052 2.2e+05 -0.000 1.000 -4.3e+05 4.3e+05
County_Contra Costa County 0.3297 0.852 0.387 0.699 -1.340 2.000
County_El Dorado County -0.7317 1.446 -0.506 0.613 -3.566 2.102
County_Fresno County -1.0203 2.272 -0.449 0.653 -5.474 3.433
County_Humboldt County -1.1728 1.792 -0.654 0.513 -4.686 2.340
County_Imperial County -14.9309 2.82e+04 -0.001 1.000 -5.52e+04 5.52e+04
County_Kern County 1.5788 0.789 2.000 0.045 0.032 3.126
County_Lake County -14.3689 1.26e+04 -0.001 0.999 -2.47e+04 2.47e+04
County_Los Angeles County 0.0615 0.383 0.161 0.872 -0.689 0.812
County_Marin County 0.4643 0.920 0.505 0.614 -1.339 2.268
County_Mendocino County -2.2490 4.559 -0.493 0.622 -11.184 6.687
County_Merced County -52.6827 6.55e+11 -8.04e-11 1.000 -1.28e+12 1.28e+12
County_Monterey County -0.0289 0.702 -0.041 0.967 -1.406 1.348
County_Napa County -10.3125 3177.487 -0.003 0.997 -6238.073 6217.448
County_Orange County 0.2208 0.499 0.442 0.658 -0.758 1.199
County_Placer County 1.1442 1.024 1.118 0.264 -0.862 3.150
County_Riverside County 2.2323 0.829 2.694 0.007 0.608 3.857
County_Sacramento County 0.1222 0.601 0.203 0.839 -1.056 1.300
County_San Benito County -14.5543 4570.237 -0.003 0.997 -8972.054 8942.945
County_San Bernardino County -0.9582 1.112 -0.861 0.389 -3.138 1.222
County_San Diego County 0.1524 0.437 0.349 0.727 -0.704 1.009
County_San Francisco County 0.2942 0.539 0.546 0.585 -0.762 1.351
County_San Joaquin County -0.1669 7.233 -0.023 0.982 -14.344 14.010
County_San Luis Obispo County -1.5170 2.166 -0.700 0.484 -5.762 2.728
County_San Mateo County -1.2596 0.667 -1.889 0.059 -2.566 0.047
County_Santa Barbara County 0.5180 0.642 0.807 0.420 -0.741 1.777
County_Santa Clara County 0.2865 0.432 0.664 0.507 -0.559 1.132
County_Santa Cruz County 0.0708 0.856 0.083 0.934 -1.607 1.748
County_Shasta County -4.3921 9.888 -0.444 0.657 -23.773 14.989
County_Siskiyou County -37.6376 6.62e+08 -5.69e-08 1.000 -1.3e+09 1.3e+09
County_Solano County 1.1190 1.042 1.074 0.283 -0.924 3.162
County_Sonoma County 1.3880 1.157 1.200 0.230 -0.879 3.655
County_Stanislaus County -52.2036 2.4e+11 -2.17e-10 1.000 -4.71e+11 4.71e+11
County_Trinity County -16.0553 9122.075 -0.002 0.999 -1.79e+04 1.79e+04
County_Tuolumne County -21.2754 1.43e+05 -0.000 1.000 -2.8e+05 2.8e+05
County_Unknown 0.7625 1.076 0.709 0.479 -1.347 2.872
County_Ventura County 0.1742 0.652 0.267 0.790 -1.105 1.453
County_Yolo County -0.3866 0.764 -0.506 0.613 -1.883 1.110
=================================================================================================
Possibly complete quasi-separation: A fraction 0.17 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
For other attributes present in the data, the p-values are high only for few dummy variables and since only one (or some) of the categorical levels have a high p-value we will drop them iteratively as sometimes p-values change after dropping a variable. So, we'll not drop all variables at once.
Instead, we will do the following repeatedly using a loop:
Note: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
# running a loop to drop variables with high p-value
# initial list of columns
cols = X_train2.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
X_train_aux = X_train2[cols]
# fitting the model
model = sm.Logit(y_train, X_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'Income', 'CCAvg', 'Family_3', 'Family_4', 'Education_2', 'Education_3', 'Securities_Account_1', 'CD_Account_1', 'Online_1', 'CreditCard_1', 'County_Kern County', 'County_Riverside County', 'County_San Mateo County']
X_train3 = X_train2[selected_features]
logit4 = sm.Logit(y_train, X_train3.astype(float))
lg4 = logit4.fit(disp=False)
print(lg4.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3500
Model: Logit Df Residuals: 3486
Method: MLE Df Model: 13
Date: Fri, 30 Jul 2021 Pseudo R-squ.: 0.6628
Time: 23:15:10 Log-Likelihood: -369.38
converged: True LL-Null: -1095.5
Covariance Type: nonrobust LLR p-value: 8.803e-303
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
const -13.4751 0.719 -18.749 0.000 -14.884 -12.067
Income 0.0650 0.004 16.509 0.000 0.057 0.073
CCAvg 0.2347 0.058 4.023 0.000 0.120 0.349
Family_3 2.5704 0.285 9.007 0.000 2.011 3.130
Family_4 1.6875 0.277 6.099 0.000 1.145 2.230
Education_2 4.0424 0.350 11.549 0.000 3.356 4.728
Education_3 4.2778 0.348 12.282 0.000 3.595 4.960
Securities_Account_1 -1.0632 0.412 -2.578 0.010 -1.872 -0.255
CD_Account_1 3.8098 0.446 8.538 0.000 2.935 4.684
Online_1 -0.6407 0.208 -3.077 0.002 -1.049 -0.233
CreditCard_1 -1.0700 0.276 -3.871 0.000 -1.612 -0.528
County_Kern County 1.4700 0.729 2.016 0.044 0.041 2.899
County_Riverside County 2.0314 0.771 2.633 0.008 0.520 3.543
County_San Mateo County -1.3916 0.592 -2.351 0.019 -2.552 -0.231
===========================================================================================
Possibly complete quasi-separation: A fraction 0.15 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Now no feature has p-value greater than 0.05, so we'll consider the features in X_train3 as the final ones and lg4 as final model.
Coefficient of some levels of Education, Family, Income, CCAvg and CD_Account are positive. An increase in these will lead to increase in chances of a person purchasing a loan.
Coefficient of some levels of Securities_Account, Online and CreditCard are negative. An increase in these will lead to decrease in chances of a person purchasing a loan.
# converting coefficients to odds
odds = np.exp(lg4.params)
# finding the percentage change
perc_change_odds = (np.exp(lg4.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train3.columns).T
| const | Income | CCAvg | Family_3 | Family_4 | Education_2 | Education_3 | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | County_Kern County | County_Riverside County | County_San Mateo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.000001 | 1.067190 | 1.264466 | 13.071537 | 5.405919 | 56.964651 | 72.081422 | 0.345344 | 45.143460 | 0.526902 | 0.343009 | 4.349088 | 7.625083 | 0.248675 |
| Change_odd% | -99.999859 | 6.718953 | 26.446633 | 1207.153662 | 440.591889 | 5596.465145 | 7108.142195 | -65.465638 | 4414.346049 | -47.309796 | -65.699129 | 334.908753 | 662.508317 | -75.132521 |
Income: Holding all other features constant a 1 unit change in Income will increase the odds of a customer purchasing a loan by 1.07 times or a 6.72% increase in odds of purchasing a loan.CCAvg: Holding all other features constant a 1 unit change in CCAvg will increase the odds of a customer purchasing a loan by 1.26 times or a 26.4% increase in odds of purchasing a loan.Interpretation for other attributes can be done similarly.
# creating confusion matrix
confusion_matrix_statsmodels(lg4, X_train3, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg4, X_train3, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.962857 | 0.700906 | 0.882129 | 0.781145 |
logit_roc_auc_train = roc_auc_score(y_train, lg4.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg4.predict(X_train3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg4.predict(X_train3))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.13451091873608728
# creating confusion matrix
confusion_matrix_statsmodels(
lg4, X_train3, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = (
model_performance_classification_statsmodels(
lg4, X_train3, y_train, threshold=optimal_threshold_auc_roc
)
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.93 | 0.882175 | 0.586345 | 0.704463 |
y_scores = lg4.predict(X_train3)
prec, rec, tre = precision_recall_curve(
y_train,
y_scores,
)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.33
# creating confusion matrix
confusion_matrix_statsmodels(lg4, X_train3, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg4, X_train3, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959714 | 0.785498 | 0.787879 | 0.786687 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.13 Threshold",
"Logistic Regression-0.33 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.13 Threshold | Logistic Regression-0.33 Threshold | |
|---|---|---|---|
| Accuracy | 0.962857 | 0.930000 | 0.959714 |
| Recall | 0.700906 | 0.882175 | 0.785498 |
| Precision | 0.882129 | 0.586345 | 0.787879 |
| F1 | 0.781145 | 0.704463 | 0.786687 |
Dropping the columns from the test set that were dropped from the training set
X_test3 = X_test[list(X_train3.columns)]
Using model with default threshold
# creating confusion matrix
confusion_matrix_statsmodels(lg4, X_test3, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg4, X_test3, y_test
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.957333 | 0.644295 | 0.897196 | 0.75 |
logit_roc_auc_train = roc_auc_score(y_test, lg4.predict(X_test3))
fpr, tpr, thresholds = roc_curve(y_test, lg4.predict(X_test3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Using model with threshold=0.13
# creating confusion matrix
confusion_matrix_statsmodels(lg4, X_test3, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = (
model_performance_classification_statsmodels(
lg4, X_test3, y_test, threshold=optimal_threshold_auc_roc
)
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.926 | 0.832215 | 0.590476 | 0.690808 |
Using model with threshold = 0.33
# creating confusion matrix
confusion_matrix_statsmodels(lg4, X_test3, y_test, threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg4, X_test3, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.957333 | 0.731544 | 0.819549 | 0.77305 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.13 Threshold",
"Logistic Regression-0.33 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.13 Threshold | Logistic Regression-0.33 Threshold | |
|---|---|---|---|
| Accuracy | 0.962857 | 0.930000 | 0.959714 |
| Recall | 0.700906 | 0.882175 | 0.785498 |
| Precision | 0.882129 | 0.586345 | 0.787879 |
| F1 | 0.781145 | 0.704463 | 0.786687 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.13 Threshold",
"Logistic Regression-0.33 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.13 Threshold | Logistic Regression-0.33 Threshold | |
|---|---|---|---|
| Accuracy | 0.957333 | 0.926000 | 0.957333 |
| Recall | 0.644295 | 0.832215 | 0.731544 |
| Precision | 0.897196 | 0.590476 | 0.819549 |
| F1 | 0.750000 | 0.690808 | 0.773050 |
X = df.drop(["Personal_Loan", "ID"], axis=1) # Features
y = df["Personal_Loan"] # Labels (Target Variable)
# converting target to integers - since some functions might not work with bool type
X = pd.get_dummies(X, drop_first=True)
X.head()
| Age | Experience | Income | ZIPCode | CCAvg | Mortgage | Family_2 | Family_3 | Family_4 | Education_2 | Education_3 | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | County_Butte County | County_Contra Costa County | County_El Dorado County | County_Fresno County | County_Humboldt County | County_Imperial County | County_Kern County | County_Lake County | County_Los Angeles County | County_Marin County | County_Mendocino County | County_Merced County | County_Monterey County | County_Napa County | County_Orange County | County_Placer County | County_Riverside County | County_Sacramento County | County_San Benito County | County_San Bernardino County | County_San Diego County | County_San Francisco County | County_San Joaquin County | County_San Luis Obispo County | County_San Mateo County | County_Santa Barbara County | County_Santa Clara County | County_Santa Cruz County | County_Shasta County | County_Siskiyou County | County_Solano County | County_Sonoma County | County_Stanislaus County | County_Trinity County | County_Tuolumne County | County_Unknown | County_Ventura County | County_Yolo County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 1.6 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90089 | 1.5 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94720 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94112 | 2.7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91330 | 1.0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
feature_names = list(X)
print(feature_names)
['Age', 'Experience', 'Income', 'ZIPCode', 'CCAvg', 'Mortgage', 'Family_2', 'Family_3', 'Family_4', 'Education_2', 'Education_3', 'Securities_Account_1', 'CD_Account_1', 'Online_1', 'CreditCard_1', 'County_Butte County', 'County_Contra Costa County', 'County_El Dorado County', 'County_Fresno County', 'County_Humboldt County', 'County_Imperial County', 'County_Kern County', 'County_Lake County', 'County_Los Angeles County', 'County_Marin County', 'County_Mendocino County', 'County_Merced County', 'County_Monterey County', 'County_Napa County', 'County_Orange County', 'County_Placer County', 'County_Riverside County', 'County_Sacramento County', 'County_San Benito County', 'County_San Bernardino County', 'County_San Diego County', 'County_San Francisco County', 'County_San Joaquin County', 'County_San Luis Obispo County', 'County_San Mateo County', 'County_Santa Barbara County', 'County_Santa Clara County', 'County_Santa Cruz County', 'County_Shasta County', 'County_Siskiyou County', 'County_Solano County', 'County_Sonoma County', 'County_Stanislaus County', 'County_Trinity County', 'County_Tuolumne County', 'County_Unknown', 'County_Ventura County', 'County_Yolo County']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)
(3500, 53) (1500, 53)
If the frequency of class A is 9% and the frequency of class B is 91%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.09,1:0.91} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.09, 1: 0.91}, random_state=1
)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.09, 1: 0.91}, random_state=1)
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
make_confusion_matrix(model, y_test)
y_train.value_counts(1)
0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64
We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 90% accuracy, hence accuracy is not a good metric to evaluate here.
True Positives:
True Negatives:
False Positives:
False Negatives:
## Function to calculate recall score
def get_recall_score(model):
"""
model : classifier to predict values of X
"""
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
get_recall_score(model)
Recall on training set : 1.0 Recall on test set : 0.8859060402684564
plt.figure(figsize=(20,30))
out = tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None,)
#below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [219.15, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account_1 <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Mortgage <= 102.50 | | | | | |--- Income <= 68.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- Income > 68.50 | | | | | | |--- CCAvg <= 3.05 | | | | | | | |--- weights: [0.99, 0.00] class: 0 | | | | | | |--- CCAvg > 3.05 | | | | | | | |--- Family_4 <= 0.50 | | | | | | | | |--- ZIPCode <= 94714.50 | | | | | | | | | |--- ZIPCode <= 90437.50 | | | | | | | | | | |--- weights: [0.27, 0.00] class: 0 | | | | | | | | | |--- ZIPCode > 90437.50 | | | | | | | | | | |--- County_San Diego County <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- County_San Diego County > 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | |--- ZIPCode > 94714.50 | | | | | | | | | |--- weights: [0.54, 0.00] class: 0 | | | | | | | |--- Family_4 > 0.50 | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | |--- Mortgage > 102.50 | | | | | |--- weights: [1.89, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [3.78, 0.00] class: 0 | | |--- CD_Account_1 > 0.50 | | | |--- weights: [0.00, 4.55] class: 1 |--- Income > 92.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- Income <= 103.50 | | | | | | |--- CCAvg <= 3.21 | | | | | | | |--- weights: [3.60, 0.00] class: 0 | | | | | | |--- CCAvg > 3.21 | | | | | | | |--- ZIPCode <= 91485.50 | | | | | | | | |--- weights: [0.00, 2.73] class: 1 | | | | | | | |--- ZIPCode > 91485.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Income > 103.50 | | | | | | |--- County_Stanislaus County <= 0.50 | | | | | | | |--- weights: [38.88, 0.00] class: 0 | | | | | | |--- County_Stanislaus County > 0.50 | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- Income <= 102.00 | | | | | | | |--- County_Los Angeles County <= 0.50 | | | | | | | | |--- weights: [0.00, 0.91] class: 1 | | | | | | | |--- County_Los Angeles County > 0.50 | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | | |--- Income > 102.00 | | | | | | | |--- CD_Account_1 <= 0.50 | | | | | | | | |--- weights: [0.00, 10.01] class: 1 | | | | | | | |--- CD_Account_1 > 0.50 | | | | | | | | |--- weights: [0.00, 7.28] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- Income <= 108.50 | | | | | |--- weights: [0.99, 0.00] class: 0 | | | | |--- Income > 108.50 | | | | | |--- Age <= 26.00 | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | |--- Age > 26.00 | | | | | | |--- ZIPCode <= 90019.50 | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | | |--- ZIPCode > 90019.50 | | | | | | | |--- County_Santa Barbara County <= 0.50 | | | | | | | | |--- weights: [0.00, 30.94] class: 1 | | | | | | | |--- County_Santa Barbara County > 0.50 | | | | | | | | |--- Online_1 <= 0.50 | | | | | | | | | |--- weights: [0.00, 0.91] class: 1 | | | | | | | | |--- Online_1 > 0.50 | | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | |--- Education_2 > 0.50 | | | |--- Income <= 110.50 | | | | |--- CCAvg <= 2.90 | | | | | |--- County_San Francisco County <= 0.50 | | | | | | |--- weights: [3.87, 0.00] class: 0 | | | | | |--- County_San Francisco County > 0.50 | | | | | | |--- Online_1 <= 0.50 | | | | | | | |--- CCAvg <= 1.40 | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.40 | | | | | | | | |--- weights: [0.18, 0.00] class: 0 | | | | | | |--- Online_1 > 0.50 | | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | | |--- weights: [0.00, 0.91] class: 1 | | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | |--- CCAvg > 2.90 | | | | | |--- ZIPCode <= 95083.00 | | | | | | |--- Family_2 <= 0.50 | | | | | | | |--- weights: [0.00, 3.64] class: 1 | | | | | | |--- Family_2 > 0.50 | | | | | | | |--- weights: [0.00, 1.82] class: 1 | | | | | |--- ZIPCode > 95083.00 | | | | | | |--- weights: [0.18, 0.00] class: 0 | | | |--- Income > 110.50 | | | | |--- Income <= 116.50 | | | | | |--- Mortgage <= 141.50 | | | | | | |--- Age <= 60.50 | | | | | | | |--- CCAvg <= 1.20 | | | | | | | | |--- weights: [0.18, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.20 | | | | | | | | |--- ZIPCode <= 94887.00 | | | | | | | | | |--- ZIPCode <= 91520.50 | | | | | | | | | | |--- Experience <= 12.50 | | | | | | | | | | | |--- weights: [0.00, 1.82] class: 1 | | | | | | | | | | |--- Experience > 12.50 | | | | | | | | | | | |--- weights: [0.18, 0.00] class: 0 | | | | | | | | | |--- ZIPCode > 91520.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- ZIPCode > 94887.00 | | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | | |--- Age > 60.50 | | | | | | | |--- weights: [0.27, 0.00] class: 0 | | | | | |--- Mortgage > 141.50 | | | | | | |--- weights: [0.36, 0.00] class: 0 | | | | |--- Income > 116.50 | | | | | |--- Online_1 <= 0.50 | | | | | | |--- weights: [0.00, 31.85] class: 1 | | | | | |--- Online_1 > 0.50 | | | | | | |--- weights: [0.00, 66.43] class: 1 | |--- Education_3 > 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.35 | | | | |--- Mortgage <= 236.00 | | | | | |--- weights: [3.60, 0.00] class: 0 | | | | |--- Mortgage > 236.00 | | | | | |--- CCAvg <= 1.25 | | | | | | |--- Experience <= 14.00 | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | | |--- Experience > 14.00 | | | | | | | |--- weights: [0.00, 1.82] class: 1 | | | | | |--- CCAvg > 1.25 | | | | | | |--- weights: [0.36, 0.00] class: 0 | | | |--- CCAvg > 2.35 | | | | |--- Age <= 64.00 | | | | | |--- County_Santa Barbara County <= 0.50 | | | | | | |--- ZIPCode <= 90389.50 | | | | | | | |--- weights: [0.18, 0.00] class: 0 | | | | | | |--- ZIPCode > 90389.50 | | | | | | | |--- County_Yolo County <= 0.50 | | | | | | | | |--- County_Orange County <= 0.50 | | | | | | | | | |--- County_San Bernardino County <= 0.50 | | | | | | | | | | |--- Age <= 42.00 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- Age > 42.00 | | | | | | | | | | | |--- weights: [0.00, 11.83] class: 1 | | | | | | | | | |--- County_San Bernardino County > 0.50 | | | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | | | | |--- County_Orange County > 0.50 | | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | | | |--- County_Yolo County > 0.50 | | | | | | | | |--- weights: [0.09, 0.00] class: 0 | | | | | |--- County_Santa Barbara County > 0.50 | | | | | | |--- weights: [0.18, 0.00] class: 0 | | | | |--- Age > 64.00 | | | | | |--- weights: [0.27, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- Mortgage <= 37.50 | | | | |--- weights: [0.00, 64.61] class: 1 | | | |--- Mortgage > 37.50 | | | | |--- weights: [0.00, 39.13] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 6.304534e-01 CCAvg 9.474637e-02 Family_4 8.158651e-02 Education_2 6.084904e-02 Family_3 6.021242e-02 Education_3 1.997043e-02 Mortgage 1.282500e-02 ZIPCode 1.105999e-02 CD_Account_1 7.255481e-03 Age 6.087707e-03 County_San Francisco County 3.351209e-03 Experience 2.273227e-03 Online_1 1.760819e-03 County_Monterey County 1.162319e-03 County_Orange County 1.160398e-03 County_Santa Barbara County 1.138337e-03 County_San Diego County 6.679308e-04 Family_2 5.854033e-04 County_Contra Costa County 5.819546e-04 County_Los Angeles County 5.813309e-04 County_San Bernardino County 5.712004e-04 County_Yolo County 5.604598e-04 CreditCard_1 5.590602e-04 County_Stanislaus County 7.383381e-16 County_Lake County 0.000000e+00 County_San Luis Obispo County 0.000000e+00 County_Ventura County 0.000000e+00 County_Unknown 0.000000e+00 County_Tuolumne County 0.000000e+00 County_Trinity County 0.000000e+00 County_Sonoma County 0.000000e+00 County_Solano County 0.000000e+00 County_Siskiyou County 0.000000e+00 County_Shasta County 0.000000e+00 County_Santa Cruz County 0.000000e+00 County_Santa Clara County 0.000000e+00 Securities_Account_1 0.000000e+00 County_San Mateo County 0.000000e+00 County_San Joaquin County 0.000000e+00 County_Kern County 0.000000e+00 County_Butte County 0.000000e+00 County_El Dorado County 0.000000e+00 County_San Benito County 0.000000e+00 County_Sacramento County 0.000000e+00 County_Riverside County 0.000000e+00 County_Placer County 0.000000e+00 County_Fresno County 0.000000e+00 County_Napa County 0.000000e+00 County_Humboldt County 0.000000e+00 County_Imperial County 0.000000e+00 County_Mendocino County 0.000000e+00 County_Marin County 0.000000e+00 County_Merced County 0.000000e+00
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The tree above is very complex and difficult to interpret.
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.09, 1: 0.91})
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(1, 10),
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.000001, 0.00001, 0.0001],
"max_features": ["log2", "sqrt"],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.09, 1: 0.91}, criterion='entropy',
max_depth=1, max_features='log2',
min_impurity_decrease=1e-06, random_state=1,
splitter='random')
make_confusion_matrix(estimator, y_test)
get_recall_score(estimator)
Recall on training set : 1.0 Recall on test set : 0.9865771812080537
Recall has improved for both train and test set after hyperparameter tuning, and we have got a generalized model.
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- County_Butte County <= 0.86 | |--- weights: [284.04, 301.21] class: 1 |--- County_Butte County > 0.86 | |--- weights: [1.17, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Here we will see that importance of features has increased
Imp County_Butte County 1.0 Age 0.0 County_San Mateo County 0.0 County_Orange County 0.0 County_Placer County 0.0 County_Riverside County 0.0 County_Sacramento County 0.0 County_San Benito County 0.0 County_San Bernardino County 0.0 County_San Diego County 0.0 County_San Francisco County 0.0 County_San Joaquin County 0.0 County_San Luis Obispo County 0.0 County_Santa Barbara County 0.0 County_Monterey County 0.0 County_Santa Clara County 0.0 County_Santa Cruz County 0.0 County_Shasta County 0.0 County_Siskiyou County 0.0 County_Solano County 0.0 County_Sonoma County 0.0 County_Stanislaus County 0.0 County_Trinity County 0.0 County_Tuolumne County 0.0 County_Unknown 0.0 County_Ventura County 0.0 County_Napa County 0.0 County_Merced County 0.0 Experience 0.0 CD_Account_1 0.0 Income 0.0 ZIPCode 0.0 CCAvg 0.0 Mortgage 0.0 Family_2 0.0 Family_3 0.0 Family_4 0.0 Education_2 0.0 Education_3 0.0 Securities_Account_1 0.0 Online_1 0.0 County_Mendocino County 0.0 CreditCard_1 0.0 County_Contra Costa County 0.0 County_El Dorado County 0.0 County_Fresno County 0.0 County_Humboldt County 0.0 County_Imperial County 0.0 County_Kern County 0.0 County_Lake County 0.0 County_Los Angeles County 0.0 County_Marin County 0.0 County_Yolo County 0.0
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The DecisionTreeClassifier provides parameters such as
min_samples_leaf and max_depth to prevent a tree from overfitting. Cost
complexity pruning provides another option to control the size of a tree. In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect of
ccp_alpha on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path function that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.09, 1: 0.91})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -3.377129e-17 |
| 1 | 4.600529e-19 | -3.331123e-17 |
| 2 | 1.192730e-18 | -3.211850e-17 |
| 3 | 3.101097e-18 | -2.901741e-17 |
| 4 | 3.101097e-18 | -2.591631e-17 |
| 5 | 9.820141e-18 | -1.609617e-17 |
| 6 | 9.303291e-17 | 7.693674e-17 |
| 7 | 3.688942e-16 | 4.458310e-16 |
| 8 | 5.695682e-16 | 1.015399e-15 |
| 9 | 1.513354e-04 | 3.026709e-04 |
| 10 | 1.518054e-04 | 6.062817e-04 |
| 11 | 1.527184e-04 | 9.117186e-04 |
| 12 | 1.530412e-04 | 1.217801e-03 |
| 13 | 2.793220e-04 | 1.497123e-03 |
| 14 | 2.908272e-04 | 2.369605e-03 |
| 15 | 2.924838e-04 | 2.662088e-03 |
| 16 | 2.944660e-04 | 3.545486e-03 |
| 17 | 3.024456e-04 | 3.847932e-03 |
| 18 | 3.035094e-04 | 4.454951e-03 |
| 19 | 4.370278e-04 | 7.077117e-03 |
| 20 | 5.470919e-04 | 8.171301e-03 |
| 21 | 5.500728e-04 | 8.721374e-03 |
| 22 | 5.943021e-04 | 9.315676e-03 |
| 23 | 6.004323e-04 | 9.916108e-03 |
| 24 | 6.803179e-04 | 1.127674e-02 |
| 25 | 7.374239e-04 | 1.348902e-02 |
| 26 | 7.558687e-04 | 1.424488e-02 |
| 27 | 7.944340e-04 | 1.503932e-02 |
| 28 | 9.380096e-04 | 1.597733e-02 |
| 29 | 1.317557e-03 | 1.729488e-02 |
| 30 | 1.397511e-03 | 1.869240e-02 |
| 31 | 1.674357e-03 | 2.036675e-02 |
| 32 | 1.974358e-03 | 2.431547e-02 |
| 33 | 2.165894e-03 | 2.648136e-02 |
| 34 | 2.412410e-03 | 2.889377e-02 |
| 35 | 3.052127e-03 | 3.194590e-02 |
| 36 | 3.220628e-03 | 3.516653e-02 |
| 37 | 3.431202e-03 | 3.859773e-02 |
| 38 | 3.625040e-03 | 4.222277e-02 |
| 39 | 3.718793e-03 | 4.966035e-02 |
| 40 | 4.556814e-03 | 5.877398e-02 |
| 41 | 5.523677e-03 | 6.982134e-02 |
| 42 | 2.366879e-02 | 9.349013e-02 |
| 43 | 2.729645e-02 | 2.026759e-01 |
| 44 | 2.969519e-01 | 4.996278e-01 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.09, 1: 0.91}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2969518641725174
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train3 = clf.predict(X_train)
values_train = metrics.recall_score(y_train, pred_train3)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test3 = clf.predict(X_test)
values_test = metrics.recall_score(y_test, pred_test3)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas,
recall_train,
marker="o",
label="train",
drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
Maximum value of Recall is at 0.004 alpha.
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.004556813877919528,
class_weight={0: 0.09, 1: 0.91}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.004556813877919528,
class_weight={0: 0.09, 1: 0.91}, random_state=1)
make_confusion_matrix(best_model, y_test)
get_recall_score(best_model)
Recall on training set : 0.9879154078549849 Recall on test set : 0.9798657718120806
plt.figure(figsize=(5, 5))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [219.15, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [10.53, 13.65] class: 1 |--- Income > 92.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- weights: [43.02, 2.73] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- weights: [0.18, 18.20] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- weights: [1.26, 31.85] class: 1 | | |--- Education_2 > 0.50 | | | |--- Income <= 110.50 | | | | |--- CCAvg <= 2.90 | | | | | |--- weights: [4.23, 0.91] class: 0 | | | | |--- CCAvg > 2.90 | | | | | |--- weights: [0.18, 5.46] class: 1 | | | |--- Income > 110.50 | | | | |--- weights: [1.08, 104.65] class: 1 | |--- Education_3 > 0.50 | | |--- weights: [5.58, 123.76] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.685624 Family_4 0.087835 Education_2 0.068961 Family_3 0.068240 CCAvg 0.066707 Education_3 0.022633 County_San Diego County 0.000000 County_San Mateo County 0.000000 County_San Luis Obispo County 0.000000 County_San Joaquin County 0.000000 County_San Francisco County 0.000000 Age 0.000000 County_Santa Clara County 0.000000 County_San Bernardino County 0.000000 County_San Benito County 0.000000 County_Sacramento County 0.000000 County_Santa Barbara County 0.000000 County_Solano County 0.000000 County_Santa Cruz County 0.000000 County_Shasta County 0.000000 County_Siskiyou County 0.000000 County_Placer County 0.000000 County_Sonoma County 0.000000 County_Stanislaus County 0.000000 County_Trinity County 0.000000 County_Tuolumne County 0.000000 County_Unknown 0.000000 County_Ventura County 0.000000 County_Riverside County 0.000000 County_Merced County 0.000000 County_Orange County 0.000000 County_El Dorado County 0.000000 ZIPCode 0.000000 Mortgage 0.000000 Family_2 0.000000 Securities_Account_1 0.000000 CD_Account_1 0.000000 Online_1 0.000000 CreditCard_1 0.000000 County_Butte County 0.000000 County_Contra Costa County 0.000000 County_Fresno County 0.000000 County_Napa County 0.000000 County_Humboldt County 0.000000 County_Imperial County 0.000000 County_Kern County 0.000000 County_Lake County 0.000000 County_Los Angeles County 0.000000 County_Marin County 0.000000 County_Mendocino County 0.000000 Experience 0.000000 County_Monterey County 0.000000 County_Yolo County 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
comparison_frame = pd.DataFrame(
{
"Model": [
"Initial decision tree model",
"Decision tree with hyperparameter tuning",
"Decision tree with post-pruning",
],
"Train_Recall": [1, 1, 0.99],
"Test_Recall": [0.88, 0.97, 0.98],
}
)
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 1.00 | 0.88 |
| 1 | Decision tree with hyperparameter tuning | 1.00 | 0.97 |
| 2 | Decision tree with post-pruning | 0.99 | 0.98 |
Decision tree model with post pruning has given the best recall score on the test data.
We have been able to build a Logistic Regression model that can be used by the bank to predict if customers will purchase a personal loan in the new campaign with a recall score of 0.88 on the training set and 0.83 on the test set.
Coefficient of some levels of Education, Family, Income, CCAvg and CD_Account are positive. An increase in these will lead to increase in chances of a person purchasing a loan.
Coefficient of some levels of Securities_Account, Online and CreditCard are negative. An increase in these will lead to decrease in chances of a person purchasing a loan.
We also used Decision Tree Classifier to build a predictive model for the same. To determine if a customer will purchase a loan or not. We ended up with a Recall score of 0.99 on the training set and 0.98 on the test set.
We verified the fact that much less data preparation is needed for Decision Trees and such a simple model gave good results even with outliers and imbalanced classes which shows the robustness of Decision Trees.
Income , Family Size, Education level and Average spending on Credit cards are the most important features to predict customer's who will purchase a loan.
Train and Test scores are comparable for both Logistic Regression and Decision Tree models. Hence we can conclude the models are good for predictions as well as inference purposes.
Decision Tree with Post-Pruning is our preferred model which follows all the assumptions, and can be used for interpretations.
The model is able to reduce a higher percentage of False Negatives compared to the Logistic Regression model.
The retail marketing department should devise campaigns to target: